Pdf Submission Format

Employee Churn Prediction


1. | Introduction 👋

🎳 Problem Statement

As the global economy evolves, employee churn has become a critical challenge for organizations across various industries. The ability to predict employee churn accurately can provide valuable insights for HR departments and management teams to implement proactive measures aimed at retention and talent management. In this project, we aim to leverage various machine learning and deep learning techniques to develop a predictive model for employee churn, ultimately assisting organizations in identifying at-risk employees and devising effective retention strategies.

🤔 Dataset Problems

This dataset is taken from the Kaggle Website. This dataset contains Employee data (i.e., name, age, department, income, etc.) along with information on employechurnn. With the help of machine learning models based on the employee information provided in the dataset, we aim to uncover the factors that lead to employechurnon more deeply in this notebook.

📌 Notebook Objectives

This notebook aims to:
  • Perform dataset exploration using various types of data visualization.
  • Build machine learning model that can predict employee attrition.
  • Export prediction result on test data into files.

👨‍💻 Machine Learning Model

The models used in this notebook:
  1. Logistic Regression,
  2. K-Nearest Neighbour (KNN),
  3. Support Vector Machine (SVM),
  4. Gaussian Naive Bayes,
  5. Decision Tree,
  6. Random Forest,
  7. Extra Tree Classifier,
  8. Gradient Boosting, and
  9. AdaBoost.

2. | Installing and Importing Libraries 📚

Installing and Importing libraries that will be used in this notebook.
In [1]:
# --- Importing Libraries ---
from IPython.display import display, HTML, Javascript
import numpy as np
import pandas as pd
import ydata_profiling
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
import warnings
import os
import yellowbrick
import joblib
import tensorflow as tf
from tensorflow import keras

from ydata_profiling import ProfileReport
from statsmodels.graphics.gofplots import qqplot
from PIL import Image
from highlight_text import fig_text
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, accuracy_score,f1_score, precision_score, recall_score
from yellowbrick.classifier import PrecisionRecallCurve, ROCAUC, ConfusionMatrix
from yellowbrick.model_selection import LearningCurve, FeatureImportances
from yellowbrick.contrib.wrapper import wrap
from yellowbrick.style import set_palette
warnings.filterwarnings("ignore")
WARNING:tensorflow:From C:\Users\Darshan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

3. | Reading Dataset 👓

After importing libraries, the dataset that will be used will be imported.
In [2]:
# --- Importing Dataset ---
df = pd.read_csv("employee.csv")

# --- Reading Train Dataset ---
class Color:
    # Define color codes
    start = '\033[91m'
    end = '\033[0m'
    color = '\033[94m'

# Create an instance of the Color class
clr = Color()

# Reading Train Dataset
print(clr.start + '.: Imported Dataset :.' + clr.end)
print(clr.color + '*' * 23)
styled_df = df.head(10).reset_index(drop=True).style.background_gradient(cmap='Blues').set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', '')]}])
styled_df
.: Imported Dataset :.
***********************
Out[2]:
  Age BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager Attrition
0 41 Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5 1
1 49 Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7 0
2 37 Travel_Rarely 1373 Research & Development 2 2 Other 1 4 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0 1
3 33 Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0 0
4 27 Travel_Rarely 591 Research & Development 2 1 Medical 1 7 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2 0
5 32 Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 4 Male 79 3 1 Laboratory Technician 4 Single 3068 11864 0 Y No 13 3 3 80 0 8 2 2 7 7 3 6 0
6 59 Travel_Rarely 1324 Research & Development 3 3 Medical 1 10 3 Female 81 4 1 Laboratory Technician 1 Married 2670 9964 4 Y Yes 20 4 1 80 3 12 3 2 1 0 0 0 0
7 30 Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1 11 4 Male 67 3 1 Laboratory Technician 3 Divorced 2693 13335 1 Y No 22 4 2 80 1 1 2 3 1 0 0 0 0
8 38 Travel_Frequently 216 Research & Development 23 3 Life Sciences 1 12 4 Male 44 2 3 Manufacturing Director 3 Single 9526 8787 0 Y No 21 4 2 80 0 10 2 3 9 7 1 8 0
9 36 Travel_Rarely 1299 Research & Development 27 3 Medical 1 13 3 Male 94 3 2 Healthcare Representative 3 Married 5237 16577 6 Y No 13 3 2 80 2 17 3 2 7 7 7 7 0

Dataset Description🧾

The following is the structure of the dataset.
Variable Name Description Sample Data
Age Employee age (in years) [41, 49, 37]
Attrition Employee attrition status ['Yes','No']
BusinessTravel Frequency of business travel ['Travel_Rarely', 'Travel_Frequently', 'No Travel']
DailyRate Daily rate of pay [102, 279, 1373]
Department Department where employee works ['Sales', 'Research & Development', 'Human Resources']
DistanceFromHome Distance of employee's home from workplace (in miles) [1, 8, 2]
Education Level of education attained [2, 1, 4]
EducationField Field of education ['Life Sciences', 'Other', 'Medical']
HourlyRate Hourly rate of pay [94, 61, 92]
JobInvolvement Level of involvement in job [3, 2, 4, 1]
JobLevel Level of job position [2,1, 3 ]
JobRole Role of employee in the job ['Sales Executive', 'Research Scientist', 'Labora tory T echni cian']
JobSatisfaction Marital status of employee ['Single', ' Married', 'Divorced']
MonthlyRate Monthly rate of pay [19479, 24907, 2396]
NumCompaniesWorked Number of companies worked at previously [8, 1, 6]
OverTime Whether the employee works overtime ['Yes', 'No']
PercentSalaryHike Percentage increase in salary [11, 23, 15]
PerformanceRating Performance rating of employee [3, 4 ]
RelationshipSatisfaction Level of satisfaction with relationships at work [1, 4, 2, 3]
StandardHours Standard hours of work per week [80]
StockOptionLevel Level of stock option available to employee< /td> [0, 3, 2]
TotalWorkingears Total number of years work ed [8, 10]
TrainingTimesLastYear Number of training time s last year [0, 3,2]
WorkLifeBalance Level of work-life balance [1, 3, 2, 4]


4. | Data Preprocessing 🔍

This section will focused on initial data exploration on the dataset with Pandas Profiling before pre-processing performed. In addition, variables correlation will be examined as well.
In [3]:
# --- Dataset Report ---
ProfileReport(df, 
               title="Employee Attrition Prediction",
               minimal=True,
               progress_bar=False,
               samples=None,
               interactions=None,
               explorative=True,
               dark_mode=True,
               notebook={'iframe': {'height': '600px'},
                         'html': {'style': {'primary_color': clr}},
                         'missing_diagrams': {'heatmap': False, 'dendrogram': False}}
              ).to_notebook_iframe()

Some columns can be removed because their values do not affect the analysis results

  • Over18: All values are 'Y'
  • EmployeeCount: All values are 1
  • StandardHours: All values are 80
  • EmployeeNumber: Identifier for employees
  • PercentSalaryHike: less correlated
  • YearsSinceLastPromotion: less correlated
  • DailyRate: Can be calculated from monthly income
  • HourlyRate: Can be calculated from monthly income
  • MonthlyRate: Can be calculated from monthly income
  • PerformanceRating: Can be calculated from monthly income
  • NumCompaniesWorked: Can be calculated from monthly income
  • EducationNuCan be calculated from monthly incomeloyees
In [4]:
df = df.drop(['Over18','EmployeeCount','StandardHours','EmployeeNumber','PercentSalaryHike','YearsSinceLastPromotion'],axis=1)
In [5]:
df = df.drop(['DailyRate' ,'HourlyRate','MonthlyRate','PerformanceRating'],axis=1)
In [6]:
df = df.drop(['NumCompaniesWorked','Education'],axis=1)
In [7]:
# --- Correlation Map Variables ---
palette = ["#4361EE", "#7209B7", "#3A0CA3", "#4CC9F0","#F72585"]
corr = df.corr(numeric_only=True)
colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange', 'cyan', 'magenta']
suptitle = dict(x=0.1, y=1.01, fontsize=26, weight='heavy', ha='left', va='bottom', fontname='arial')
title = dict(x=0.1, y=0.98, fontsize=20, weight='normal', ha='left', va='bottom', fontname='calibri')
xy_label = dict(size=12)
highlight_textprops = [{'weight':'bold', 'color': colors[0]},{'weight':'bold', 'color': colors[1]},{'weight':'bold', 'color': colors[2]}]

# --- Correlation Map (Heatmap) ---
mask = np.triu(np.ones_like(corr, dtype=bool))
fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm', linewidths=0.2, cbar=False, annot_kws={"size": 10}, rasterized=True)
yticks, ylabels = plt.yticks()
xticks, xlabels = plt.xticks()
ax.set_xticklabels(xlabels, rotation=90, **xy_label)
ax.set_yticklabels(ylabels, **xy_label)
ax.grid(False)
fig_text(s='Numerical Variables Correlation Map', **suptitle)
fig_text(s='<Age, Job Level, and Monthly Income> <negatively correlate> with <target> Attr.', highlight_textprops=highlight_textprops, **title)
plt.tight_layout(rect=[0, 0.08, 1, 1.01])
No description has been provided for this image
From dataset report and correlation matrix, it can be concluded that:
  • There are no missing values detected in the dataset. In addition, it also can be seen that the number of categorical columns is more than the numerical columns.
  • Furthermore, Average Monthly Income is 6502.
  • The mean age of the employee in the dataset was 36 years old, with the maximum being 60 years old and the youngest being 18 years old.

5. | EDA 📈

This section will perform some EDA to get more insights about dataset.
In [8]:
df.hist(figsize=(20, 20), color="#3A0CA3", alpha=0.8)
plt.show()
No description has been provided for this image
In [9]:
ax=sns.boxplot(y=df['MonthlyIncome'],x=df['JobRole'],palette=palette)
plt.setp(ax.get_xticklabels(), rotation=90)
plt.grid(True,alpha=1)
plt.tight_layout()
plt.show()
No description has been provided for this image

Separating Categorical and Numerical Columns

In [10]:
#Categorical Columns
cat_columns = df.select_dtypes(['object']).columns
print(f"Categorical Columns: {cat_columns}")
#Numerical Columns
num_columns = df.select_dtypes(['number']).columns
num_columns = num_columns.drop('Attrition', errors='ignore')
print(f"Numerical Columns: {num_columns}")
Categorical Columns: Index(['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole',
       'MaritalStatus', 'OverTime'],
      dtype='object')
Numerical Columns: Index(['Age', 'DistanceFromHome', 'EnvironmentSatisfaction', 'JobInvolvement',
       'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsWithCurrManager'],
      dtype='object')
In [11]:
palette = ["#F72585", "#7209B7", "#3A0CA3", "#4361EE", "#4CC9F0"]
sns.set_palette(palette)
for i, col in enumerate(cat_columns):
    
    fig, axes = plt.subplots(1,2,figsize=(13,5))
    ax = sns.countplot(data=df,x=col , ax=axes[0])
    activities = [var for var in df[col].value_counts().sort_index().index]
    ax.set_xticklabels(activities,rotation=90)
    for container in axes[0].containers:
        axes[0].bar_label(container)
            
    index = df[col].value_counts().index
    size = df[col].value_counts().values
    explode = (0.05,0.05)
    
    axes[1].pie(size , labels=index,autopct='%1.1f%%',pctdistance=0.85)
    centre_circle = plt.Circle((0,0),0.70,fc='white')
    fig = plt.gcf()
    
    fig.gca().add_artist(centre_circle)
    plt.suptitle(col,backgroundcolor='green',color='white',fontsize=15)
    
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [12]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 30))
for idx, i in enumerate(num_columns):
    plt.subplot(13, 2, idx + 1)
    sns.boxplot(x=i, data=df,color="#4361EE")
    plt.title(i, color='#3A0CA3', fontsize=20)
    plt.xlabel(i, size=12)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [13]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# EDA 1 Dataframes
df_eda1 = df[['Gender', 'Department', 'Attrition']]
df_eda1 = pd.DataFrame(df_eda1.groupby(['Gender', 'Attrition', 'Department']).size().reset_index(name='total'))
df_eda1['total'] *= np.where(df_eda1['Attrition'] == 0, 1, -1)
df_eda1_m = df_eda1.query('Gender == "Male" & Attrition == 0')
df_eda1_f = df_eda1.query('Gender == "Female" & Attrition == 0')
df_eda1_ms = df_eda1.query('Gender == "Male" & Attrition == 1')
df_eda1_fs = df_eda1.query('Gender == "Female" & Attrition == 1')

# Plotting
fig, ax = plt.subplots(figsize=(9, 5))
bar_mns = plt.barh(np.arange(len(df_eda1_m)), df_eda1_m['total'], color='#F72595', height=0.35, label='Male Not Attributed')
bar_fns = plt.barh(np.arange(len(df_eda1_f)), df_eda1_f['total'], color='#4CC9F0', height=0.35, label='Female Not Attributed')
bar_ms = plt.barh(np.arange(len(df_eda1_ms)) + 0.35, df_eda1_ms['total'], color='#4361EE', height=0.35, label='Male Attributed')
bar_fs = plt.barh(np.arange(len(df_eda1_fs)) + 0.35, df_eda1_fs['total'], color='#7209B7', height=0.35, label='Female Attributed')

ax.set_yticks(np.arange(len(df_eda1.Department.unique())) + 0.35 / 2)
ax.set_yticklabels(df_eda1.Department.unique(), fontsize=7)
plt.xlabel('\nTotal', fontweight='bold', fontsize=8)
plt.ylabel('Department\n', fontweight='bold', fontsize=8)
plt.grid(axis='y', alpha=0, zorder=2)
plt.grid(axis='x', which='major', alpha=0.3, linestyle='dotted', zorder=1)

plt.axvspan(-85, 0, color='#4361EE', alpha=0.2)
plt.axvspan(40, 0, color='#4CC9F0', alpha=0.2)

plt.legend(fontsize=7)
plt.tick_params(bottom='on', length=3, width=1)
ax.spines['bottom'].set_color('black')

plt.suptitle('Attrition Distribution based on Department and Gender', x=0.16, y=0.96, fontsize=13, weight='heavy', ha='left', va='bottom')
plt.title("Attrition rates based on Departments for both male and female employees.", x=0.16, y=0.93, fontsize=8, weight='normal', ha='left', va='bottom')

plt.show()
No description has been provided for this image

💡 Analysis of graphs¶

From dataset report and correlation matrix, it can be concluded that:

  • Attrition is the highest for both men and women from 18 to 35 years of age and gradually decreases.
  • As income increases, attrition decreases.
  • Attrition is much, much less in divorced women.
  • Attrition is higher for employees who usually travel than others, and this rate is higher for women than for men.
  • Attrition is the highest for those in level 1 jobs.
  • Women with the job position of manager, research director, and laboratory technician have almost no attrition.
  • Men with the position of sales expert have more attrition as compare to female.
  • Department of Research & Devlopment has highest attrition in terms both male and female as compared to other deparemnt

6. | Data Preprocessing ⚙️

This section will prepare the dataset before building the machine learning models.

6.1 | Features Separating and Splitting 🪓

In this section, the 'Attrition' (dependent) column will be seperated from independent columns. Also, the dataset will be splitted into 90:10 ratio (90% training and 10% testing).
In [14]:
# --- Seperating Dependent Features ---
X = df.drop(['Attrition'], axis=1)
y = df['Attrition']

# --- Splitting Dataset ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

6.2 | Processing Pipeline 🪠

This section will create a preprocessing pipeline for numerical and categorical columns and apply them to the x_train and x_test data. Not all columns will go through preprocessing. For all numerical columns, scaling will be carried out using a MinMax scaler since the dataset used is a small dataset where the presence of outliers dramatically affects the performance of a model. While for categorical columns with more than two categories, one-hot encoding will be carried out.
In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer

# --- Numerical Pipeline ---
num_pipeline = Pipeline([
    ('scaling', MinMaxScaler())
])

# --- Categorical Pipeline ---
cat_pipeline = Pipeline([
    ('onehot', OneHotEncoder(drop='first', sparse=False))
])

# --- Combine Both Pipelines into Transformer ---
preprocessor = ColumnTransformer([
    ('categorical', cat_pipeline, cat_columns),
    ('numerical', num_pipeline, num_columns)
], remainder='passthrough')

# --- Apply Transformer to Pipeline ---
process_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])
# --- Apply to DataFrame --- 
X_train_process = process_pipeline.fit_transform(X_train)
X_test_process = process_pipeline.fit_transform(X_test)

6.3 | Treating Dataset Imbalance ⚖️

Dataset Imbalance: Only 16% represent churn employees, impacting model accuracy.
Solution: Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE mitigates class imbalance by oversampling the minority class.

It generates synthetic samples, increasing minority class representation.

Benefits:

1. Improved Learning: Better understanding of minority class characteristics.

2. Enhanced Predictions: Reduces bias, improving predictive performance.

Outcome:

1. Balanced Dataset: Improves model generalization and accuracy.

2. Business Impact: Informed decisions on potential churn employees.

Sampling
In [16]:
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(random_state=0)
X_sm_train , y_sm_train = oversampler.fit_resample(X_train_process,y_train)
X_sm_test , y_sm_test = oversampler.fit_resample(X_test_process,y_test)

7. | Machine Learning Model Implementation 🛠️

This section will implement various machine learning models as mentioned in Introduction section. In addition, explanation for each models also will be discussed.
In [17]:
from sklearn.metrics import f1_score, precision_score, recall_score

color_yb = sns.color_palette("Paired")
color_line = 'red'
color = 'red'

def fit_ml_models(algo, algo_param, algo_name):
    # --- Algorithm Pipeline ---
    algo = Pipeline([('algo', algo)])
    
    # --- Apply Grid Search ---
    model = GridSearchCV(algo, param_grid=algo_param, cv=10, n_jobs=-1, verbose=1)
    
    # --- Fitting Model ---
    print(clr.start + f".:. Fitting {algo_name} .:." + clr.end)
    fit_model = model.fit(X_sm_train, y_sm_train)
    
    # --- Model Best Parameters ---
    best_params = model.best_params_
    print("\n>> Best Parameters: " + clr.start + f"{best_params}" + clr.end)
    
    # --- Create Prediction for Train & Test ---
    y_pred_train = model.predict(X_sm_train)
    y_pred_test = model.predict(X_sm_test)

    # --- Calculate F1 score ---
    f1_score_train = f1_score(y_sm_train, y_pred_train)
    f1_score_test = f1_score(y_sm_test, y_pred_test)

    # --- Calculate Precision ---
    precision_train = precision_score(y_sm_train, y_pred_train)
    precision_test = precision_score(y_sm_test, y_pred_test)

    # --- Calculate Recall ---
    recall_train = recall_score(y_sm_train, y_pred_train)
    recall_test = recall_score(y_sm_test, y_pred_test)

    # --- Best & Final Estimators ---
    best_model = model.best_estimator_
    best_estimator = model.best_estimator_._final_estimator
    best_score = round(model.best_score_, 4)
    
    # --- Print Best Score ---
    print(">> Best Score: " + clr.start + "{:.3f}".format(best_score) + clr.end)
    
    # --- Train & Test Accuracy Score ---
    acc_score_train = round(accuracy_score(y_pred_train, y_sm_train) * 100, 3)
    acc_score_test = round(accuracy_score(y_pred_test, y_sm_test) * 100, 3)
    print("\n" + clr.start + f".:. Train and Test Accuracy Score for {algo_name} .:." + clr.end)
    print("\t>> Train Accuracy: " + clr.start + "{:.2f}%".format(acc_score_train) + clr.end)
    print("\t>> Test Accuracy: " + clr.start + "{:.2f}%".format(acc_score_test) + clr.end)
    
    # --- Classification Report ---
    print("\n" + clr.start + f".:. Classification Report for {algo_name} .:." + clr.end)
    print(classification_report(y_sm_test, y_pred_test))
    
    # --- Figures Settings ---
    xy_label = dict(fontweight='bold', fontsize=12)
    grid_style = dict(color=color, linestyle='dotted', zorder=1)
    title_style = dict(fontsize=14, fontweight='bold')
    tick_params = dict(length=3, width=1, color='red')
    bar_style = dict(zorder=3, edgecolor='black', linewidth=0.5, alpha=0.85)
    set_palette(color_yb)
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 14))
    
    # --- Confusion Matrix ---
    conf_matrix = ConfusionMatrix(best_estimator, ax=ax1, cmap='Reds')
    conf_matrix.fit(X_sm_train, y_sm_train)
    conf_matrix.score(X_sm_test, y_sm_test)
    conf_matrix.finalize()
    conf_matrix.ax.set_title('Confusion Matrix\n', **title_style)
    conf_matrix.ax.tick_params(axis='both', labelsize=10, bottom='on', left='on', **tick_params)
    for spine in conf_matrix.ax.spines.values(): spine.set_color(color_line)
    conf_matrix.ax.set_xlabel('\nPredicted Class', **xy_label)
    conf_matrix.ax.set_ylabel('True Class\n', **xy_label)
    conf_matrix.ax.xaxis.set_ticklabels(['False', 'True'], rotation=0)
    conf_matrix.ax.yaxis.set_ticklabels(['True', 'False'])
    
    # --- ROC AUC ---
    logrocauc = ROCAUC(best_estimator, classes=['False', 'True'], ax=ax2, colors=color_yb)
    logrocauc.fit(X_sm_train, y_sm_train)
    logrocauc.score(X_sm_test, y_sm_test)
    logrocauc.finalize()
    logrocauc.ax.set_title('ROC AUC Curve\n', **title_style)
    logrocauc.ax.tick_params(axis='both', labelsize=10, bottom='on', left='on', **tick_params)
    logrocauc.ax.grid(axis='both', alpha=0.4, **grid_style)
    for spine in logrocauc.ax.spines.values(): spine.set_color('None')
    for spine in ['bottom', 'left']:
        logrocauc.ax.spines[spine].set_visible(True)
        logrocauc.ax.spines[spine].set_color(color_line)
    logrocauc.ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.12), ncol=2, borderpad=2, frameon=False, fontsize=10)
    logrocauc.ax.set_xlabel('\nFalse Positive Rate', **xy_label)
    logrocauc.ax.set_ylabel('True Positive Rate\n', **xy_label)
    
    # --- Learning Curve ---
    lcurve = LearningCurve(best_estimator, scoring='f1_weighted', ax=ax3, colors=color_yb)
    lcurve.fit(X_sm_train, y_sm_train)
    lcurve.finalize()
    lcurve.ax.set_title('Learning Curve\n', **title_style)
    lcurve.ax.tick_params(axis='both', labelsize=10, bottom='on', left='on', **tick_params)
    lcurve.ax.grid(axis='both', alpha=0.4, **grid_style)
    for spine in lcurve.ax.spines.values(): spine.set_color('None')
    for spine in ['bottom', 'left']:
        lcurve.ax.spines[spine].set_visible(True)
        lcurve.ax.spines[spine].set_color(color_line)
    lcurve.ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.12), ncol=2, borderpad=2, frameon=False, fontsize=10)
    lcurve.ax.set_xlabel('\nTraining Instances', **xy_label)
    lcurve.ax.set_ylabel('Scores\n', **xy_label)

    try:
        feat_importance = FeatureImportances(best_estimator, labels=columns_list_onehot, ax=ax4, topn=5, colors=color_yb_importance)
        feat_importance.fit(X_sm_train, y_sm_train)
        feat_importance.finalize()
        feat_importance.ax.set_title('Feature Importances (Top 5 Features)\n', **title_style)
        feat_importance.ax.tick_params(axis='both', labelsize=10, bottom='on', left='on', **tick_params)
        feat_importance.ax.grid(axis='x', alpha=0.4, **grid_style)
        feat_importance.ax.grid(axis='y', alpha=0, **grid_style)
        for spine in feat_importance.ax.spines.values(): spine.set_color('None')
        for spine in ['bottom']:
            feat_importance.ax.spines[spine].set_visible(True)
            feat_importance.ax.spines[spine].set_color(color_line)
        feat_importance.ax.set_xlabel('\nRelative Importance', **xy_label)
        feat_importance.ax.set_ylabel('Features\n', **xy_label)
    except:
        prec_curve = PrecisionRecallCurve(best_estimator, ax=ax4, ap_score=True, iso_f1_curves=True)
        prec_curve.fit(X_sm_train, y_sm_train)
        prec_curve.score(X_sm_test, y_sm_test)
        prec_curve.finalize()
        prec_curve.ax.set_title('Precision-Recall Curve\n', **title_style)
        prec_curve.ax.tick_params(axis='both', labelsize=10, bottom='on', left='on', **tick_params)
        for spine in prec_curve.ax.spines.values(): spine.set_color('None')
        for spine in ['bottom', 'left']:
            prec_curve.ax.spines[spine].set_visible(True)
            prec_curve.ax.spines[spine].set_color(color_line)
        prec_curve.ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.12), ncol=2, borderpad=2, frameon=False, fontsize=10)
        prec_curve.ax.set_xlabel('\nRecall', **xy_label)
        prec_curve.ax.set_ylabel('Precision\n', **xy_label)


        
    plt.suptitle(f'\n{algo_name} Performance Evaluation Report\n', fontsize=18, fontweight='bold')
    plt.gcf().text(0.88, 0.02, 'kaggle.com/darshanpathak12', style='italic', fontsize=10)
    plt.tight_layout()
    
    return acc_score_train, acc_score_test, best_score, f1_score_train, f1_score_test, precision_train, precision_test, recall_train, recall_test

7.1 | Logistic Regression

Logistic regression is a statistical method that is used for building machine learning models where the dependent variable is dichotomous: i.e. binary. Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.

The name "logistic regression" is derived from the concept of the logistic function that it uses. The logistic function is also known as the sigmoid function. The value of this logistic function lies between zero and one.

Logistic Regression
In [18]:
# --- Logistic Regression Parameters ---
parameter_lr = {"algo__solver": ["lbfgs", "saga", "newton-cg"]
                , "algo__C": [0.1, 0.2, 0.5, 0.8]}

# --- Logistic Regression Algorithm ---
algo_lr = LogisticRegression(penalty="l2", random_state=42, n_jobs=-1)

# --- Applying Logistic Regression ---
acc_score_train_lr, acc_score_test_lr, best_score_lr,f1_score_train_lr,f1_score_test_lr,precision_train_lr,precision_test_lr,recall_train_lr,recall_test_lr = fit_ml_models(algo_lr, parameter_lr, "Logistic Regression")
.:. Fitting Logistic Regression .:.
Fitting 10 folds for each of 12 candidates, totalling 120 fits

>> Best Parameters: {'algo__C': 0.8, 'algo__solver': 'saga'}
>> Best Score: 0.785

.:. Train and Test Accuracy Score for Logistic Regression .:.
	>> Train Accuracy: 79.31%
	>> Test Accuracy: 67.74%

.:. Classification Report for Logistic Regression .:.
              precision    recall  f1-score   support

           0       0.66      0.74      0.70       124
           1       0.70      0.61      0.66       124

    accuracy                           0.68       248
   macro avg       0.68      0.68      0.68       248
weighted avg       0.68      0.68      0.68       248

No description has been provided for this image

7.2 | K-Nearest Neighbour (KNN)

The k-nearest neighbors (KNN) algorithm is a data classification method for estimating the likelihood that a data point will become a member of one group or another based on what group the data points nearest to it belong to. The k-nearest neighbor algorithm is a type of supervised machine learning algorithm used to solve classification and regression problems.

It's called a lazy learning algorithm or lazy learner because it doesn't perform any training when you supply the training data. Instead, it just stores the data during the training time and doesn't perform any calculations. It doesn't build a model until a query is performed on the dataset. This makes KNN ideal for data mining.

KNN
In [19]:
# --- KNN Parameters ---
parameter_knn = {"algo__n_neighbors": [2, 5, 10, 17]
                , "algo__leaf_size": [1, 10, 11, 30]}

# --- KNN Algorithm ---
algo_knn = KNeighborsClassifier(n_jobs=-1)

# --- Applying KNN ---
acc_score_train_knn, acc_score_test_knn, best_score_knn,f1_score_train_knn,f1_score_test_knn,precision_train_knn,precision_test_knn,recall_train_knn,recall_test_knn = fit_ml_models(algo_knn, parameter_knn, "K-Nearest Neighbour (KNN)")
.:. Fitting K-Nearest Neighbour (KNN) .:.
Fitting 10 folds for each of 16 candidates, totalling 160 fits

>> Best Parameters: {'algo__leaf_size': 1, 'algo__n_neighbors': 2}
>> Best Score: 0.923

.:. Train and Test Accuracy Score for K-Nearest Neighbour (KNN) .:.
	>> Train Accuracy: 100.00%
	>> Test Accuracy: 55.65%

.:. Classification Report for K-Nearest Neighbour (KNN) .:.
              precision    recall  f1-score   support

           0       0.53      0.88      0.66       124
           1       0.66      0.23      0.35       124

    accuracy                           0.56       248
   macro avg       0.60      0.56      0.50       248
weighted avg       0.60      0.56      0.50       248

No description has been provided for this image

7.3 | Support Vector Machine (SVM)

Support Vector Machine (SVM) is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
SVM
🖼 SVM by JavaTPoint
In [20]:
# --- SVM Parameters ---
parameter_svc = [
     {'algo__C': [0.6], 'algo__degree': [2], 'algo__kernel': ['poly']}
]

# --- SVM Algorithm ---
algo_svc = SVC(random_state=1, probability=True)

# --- Applying SVM ---
acc_score_train_svc, acc_score_test_svc, best_score_svc,f1_score_train_svc,f1_score_test_svc,precision_train_svc,precision_test_svc,recall_train_svc,recall_test_svc = fit_ml_models(algo_svc, parameter_svc, "Support Vector Machine (SVM)")
.:. Fitting Support Vector Machine (SVM) .:.
Fitting 10 folds for each of 1 candidates, totalling 10 fits

>> Best Parameters: {'algo__C': 0.6, 'algo__degree': 2, 'algo__kernel': 'poly'}
>> Best Score: 0.842

.:. Train and Test Accuracy Score for Support Vector Machine (SVM) .:.
	>> Train Accuracy: 86.97%
	>> Test Accuracy: 68.55%

.:. Classification Report for Support Vector Machine (SVM) .:.
              precision    recall  f1-score   support

           0       0.64      0.85      0.73       124
           1       0.78      0.52      0.62       124

    accuracy                           0.69       248
   macro avg       0.71      0.69      0.68       248
weighted avg       0.71      0.69      0.68       248

No description has been provided for this image

7.4 | Gaussian Naive Bayes

Naive Bayes Classifiers are based on the Bayes Theorem, which one assumption taken is the strong independence assumptions between the features. These classifiers assume that the value of a particular feature is independent of the value of any other feature. In a supervised learning situation, Naive Bayes Classifiers are trained very efficiently. Naive Bayes classifiers need a small training data to estimate the parameters needed for classification. Naive Bayes Classifiers have simple design and implementation and they can applied to many real life situations.

Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data. When working with continuous data, an assumption often taken is that the continuous values associated with each class are distributed according to a normal (or Gaussian) distribution.
GNB
🖼 Gaussian Naive Bayes by OpenGenus
In [21]:
# --- Gaussian NB Parameters ---
parameter_gnb = {"algo__var_smoothing": [1e-2, 1e-3, 1e-4, 1e-6]}

# --- Gaussian NB Algorithm ---
algo_gnb = GaussianNB()

# --- Applying Gaussian NB ---
acc_score_train_gnb, acc_score_test_gnb, best_score_gnb,f1_score_train_gnb,f1_score_test_gnb,precision_train_gnb,precision_test_gnb,recall_train_gnb,recall_test_gnb = fit_ml_models(algo_gnb, parameter_gnb, "Gaussian Naive Bayes")
.:. Fitting Gaussian Naive Bayes .:.
Fitting 10 folds for each of 4 candidates, totalling 40 fits

>> Best Parameters: {'algo__var_smoothing': 0.01}
>> Best Score: 0.662

.:. Train and Test Accuracy Score for Gaussian Naive Bayes .:.
	>> Train Accuracy: 66.91%
	>> Test Accuracy: 58.47%

.:. Classification Report for Gaussian Naive Bayes .:.
              precision    recall  f1-score   support

           0       0.61      0.47      0.53       124
           1       0.57      0.70      0.63       124

    accuracy                           0.58       248
   macro avg       0.59      0.58      0.58       248
weighted avg       0.59      0.58      0.58       248

No description has been provided for this image

7.5 | Decision Tree

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
DT
In [22]:
# Decision Tree Parameters
parameter_dt = {
    "algo__max_depth": np.arange(5,15,1),
}

# Decision Tree Algorithm
algo_dt = DecisionTreeClassifier(random_state=42)

# Applying Decision Tree
acc_score_train_dt, acc_score_test_dt, best_score_dt,f1_score_train_dt,f1_score_test_dt,precision_train_dt,precision_test_dt,recall_train_dt,recall_test_dt = fit_ml_models(algo_dt, parameter_dt, "Decision Tree")
.:. Fitting Decision Tree .:.
Fitting 10 folds for each of 10 candidates, totalling 100 fits

>> Best Parameters: {'algo__max_depth': 9}
>> Best Score: 0.881

.:. Train and Test Accuracy Score for Decision Tree .:.
	>> Train Accuracy: 96.26%
	>> Test Accuracy: 78.23%

.:. Classification Report for Decision Tree .:.
              precision    recall  f1-score   support

           0       0.74      0.86      0.80       124
           1       0.84      0.70      0.76       124

    accuracy                           0.78       248
   macro avg       0.79      0.78      0.78       248
weighted avg       0.79      0.78      0.78       248

No description has been provided for this image

7.6 | Random Forest

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
DT
Random Forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.
RF
In [23]:
# --- Random Forest Parameters ---
parameter_rf = {"algo__max_depth": np.arange(30,32, 1)}

# --- Random Forest Algorithm ---
algo_rf = RandomForestClassifier(random_state=99, n_jobs=-1)

# --- Applying Random Forest ---
acc_score_train_rf, acc_score_test_rf, best_score_rf,f1_score_train_rf,f1_score_test_rf,precision_train_rf,precision_test_rf,recall_train_rf,recall_test_rf = fit_ml_models(algo_rf, parameter_rf, "Random Forest")
.:. Fitting Random Forest .:.
Fitting 10 folds for each of 2 candidates, totalling 20 fits

>> Best Parameters: {'algo__max_depth': 30}
>> Best Score: 0.939

.:. Train and Test Accuracy Score for Random Forest .:.
	>> Train Accuracy: 100.00%
	>> Test Accuracy: 81.05%

.:. Classification Report for Random Forest .:.
              precision    recall  f1-score   support

           0       0.74      0.95      0.83       124
           1       0.93      0.67      0.78       124

    accuracy                           0.81       248
   macro avg       0.84      0.81      0.81       248
weighted avg       0.84      0.81      0.81       248

No description has been provided for this image

7.7 | Extra Tree Classifier

Extra Trees Classifier is a type of ensemble learning technique which aggregates the results of multiple de-correlated decision trees collected in a "forest" to output it’s classification result. In concept, it is very similar to a Random Forest Classifier and only differs from it in the manner of construction of the decision trees in the forest.

Each Decision Tree in the Extra Trees Forest is constructed from the original training sample. Then, at each test node, each tree is provided with a random sample of k features from the feature-set from which each decision tree must select the best feature to split the data based on some mathematical criteria (typically the Gini Index). This random sample of features leads to the creation of multiple decisions.
ETC
In [24]:
# --- Extra Tree Parameters ---
parameter_et = {"algo__max_depth": [2, 3]
    , "algo__max_leaf_nodes": [3, 5, 7]}

# --- Extra Tree Algorithm ---
algo_et = ExtraTreesClassifier(random_state=42, n_jobs=-1)

# --- Applying Extra Tree ---
acc_score_train_et, acc_score_test_et, best_score_et,f1_score_train_et,f1_score_test_et,precision_train_et,precision_test_et,recall_train_et,recall_test_et = fit_ml_models(algo_et, parameter_et, "Extra Tree Classifier")
.:. Fitting Extra Tree Classifier .:.
Fitting 10 folds for each of 6 candidates, totalling 60 fits

>> Best Parameters: {'algo__max_depth': 3, 'algo__max_leaf_nodes': 7}
>> Best Score: 0.770

.:. Train and Test Accuracy Score for Extra Tree Classifier .:.
	>> Train Accuracy: 78.09%
	>> Test Accuracy: 64.11%

.:. Classification Report for Extra Tree Classifier .:.
              precision    recall  f1-score   support

           0       0.62      0.73      0.67       124
           1       0.67      0.56      0.61       124

    accuracy                           0.64       248
   macro avg       0.65      0.64      0.64       248
weighted avg       0.65      0.64      0.64       248

No description has been provided for this image

7.8 | Gradient Boosting

Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set. It strongly relies on the prediction that the next model will reduce prediction errors when blended with previous ones. The main idea is to establish target outcomes for this upcoming model to minimize errors.

Gradient Boosting trains many models in a gradual, additive and sequential manner. The term gradient boosting emerged because every case’s target outcomes are based on the gradient’s error with regards to the predictions. Every model reduces prediction errors by taking a step in the correct direction.
GB
In [25]:
# --- Gradient Boosting Parameters ---
parameter_gb = {
    "algo__learning_rate": [0.1, 0.3, 0.5]
    , "algo__n_estimators": [2, 4, 6]
    , "algo__min_weight_fraction_leaf": [0.1, 0.2, 0.5]
}

# --- Gradient Boosting Algorithm ---
algo_gb = GradientBoostingClassifier(loss="exponential", random_state=2)

# --- Applying Gradient Boosting ---
acc_score_train_gb, acc_score_test_gb, best_score_gb,f1_score_train_gb,f1_score_test_gb,precision_train_gb,precision_test_gb,recall_train_gb,recall_test_gb = fit_ml_models(algo_gb, parameter_gb, "Gradient Boosting")
.:. Fitting Gradient Boosting .:.
Fitting 10 folds for each of 27 candidates, totalling 270 fits

>> Best Parameters: {'algo__learning_rate': 0.5, 'algo__min_weight_fraction_leaf': 0.1, 'algo__n_estimators': 6}
>> Best Score: 0.855

.:. Train and Test Accuracy Score for Gradient Boosting .:.
	>> Train Accuracy: 87.65%
	>> Test Accuracy: 83.47%

.:. Classification Report for Gradient Boosting .:.
              precision    recall  f1-score   support

           0       0.79      0.91      0.85       124
           1       0.90      0.76      0.82       124

    accuracy                           0.83       248
   macro avg       0.84      0.83      0.83       248
weighted avg       0.84      0.83      0.83       248

No description has been provided for this image

7.9 | AdaBoost

AdaBoost also called Adaptive Boosting is a technique in Machine Learning used as an Ensemble Method. The most common algorithm used with AdaBoost is decision trees with one level that means with Decision trees with only 1 split. These trees are also called Decision Stumps. AdaBoost builds a model and gives equal weights to all the data points. It then assigns higher weights to points that are wrongly classified. Now, all the points which have higher weights are given more importance in the next model. It will keep training models until and unless a lowe error is received.
ADABoost
In [26]:
# --- AdaBoost Parameters ---
parameter_ab = {
    "algo__n_estimators": [100,200,300]
    , "algo__learning_rate": [0.1,0.75,0.1]
}

# --- AdaBoost Algorithm ---
algo_ab = AdaBoostClassifier(random_state=1)

# --- Applying AdaBoost ---
acc_score_train_ab, acc_score_test_ab, best_score_ab,f1_score_train_ab,f1_score_test_ab,precision_train_ab,precision_test_ab,recall_train_ab,recall_test_ab = fit_ml_models(algo_ab, parameter_ab, "AdaBoost")
.:. Fitting AdaBoost .:.
Fitting 10 folds for each of 9 candidates, totalling 90 fits

>> Best Parameters: {'algo__learning_rate': 0.75, 'algo__n_estimators': 200}
>> Best Score: 0.906

.:. Train and Test Accuracy Score for AdaBoost .:.
	>> Train Accuracy: 93.91%
	>> Test Accuracy: 89.92%

.:. Classification Report for AdaBoost .:.
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       124
           1       0.92      0.88      0.90       124

    accuracy                           0.90       248
   macro avg       0.90      0.90      0.90       248
weighted avg       0.90      0.90      0.90       248

No description has been provided for this image

7.10 | XgBoost

XGBoost stands for Extreme Gradient Boosting and is a powerful and efficient implementation of the gradient boosting framework. It is widely used for supervised learning tasks, particularly in structured/tabular data problems, and excels in both speed and performance. XGBoost builds a series of decision trees sequentially, with each tree aiming to correct the errors of its predecessor. Unlike traditional gradient boosting methods, XGBoost employs a more regularized model structure, which helps prevent overfitting and enhances generalization. It also incorporates advanced features like parallel computing, tree pruning, and handling missing values, making it a popular choice in machine learning competitions and real-world applications.
XGBoost
In [27]:
# --- XGBoost Parameters ---
import xgboost as xgb
parameter_xgb = {
    "algo__n_estimators": [100],
    "algo__learning_rate": [0.75]
}

# --- XGBoost Algorithm ---
algo_xgb = xgb.XGBClassifier(random_state=1)

# --- Applying XGBoost ---
acc_score_train_xgb, acc_score_test_xgb, best_score_xgb,f1_score_train_xgb,f1_score_test_xgb,precision_train_xgb,precision_test_xgb,recall_train_xgb,recall_test_xgb = fit_ml_models(algo_xgb, parameter_xgb, "XGBoost")
.:. Fitting XGBoost .:.
Fitting 10 folds for each of 1 candidates, totalling 10 fits

>> Best Parameters: {'algo__learning_rate': 0.75, 'algo__n_estimators': 100}
>> Best Score: 0.915

.:. Train and Test Accuracy Score for XGBoost .:.
	>> Train Accuracy: 100.00%
	>> Test Accuracy: 87.90%

.:. Classification Report for XGBoost .:.
              precision    recall  f1-score   support

           0       0.83      0.96      0.89       124
           1       0.95      0.80      0.87       124

    accuracy                           0.88       248
   macro avg       0.89      0.88      0.88       248
weighted avg       0.89      0.88      0.88       248

No description has been provided for this image

7.11 | Model Comparison 👀

After implementing and tuning 9 models, this section will compare all machine learning models accuracy and best score.
In [28]:
# --- Create Accuracy Comparison Table ---
df_compare = pd.DataFrame({
                             'Model': ['Logistic Regression', 'K-Nearest Neighbour', 'Support Vector Machine', 'Gaussian NB',
                                     'Decision Tree', 'Random Forest', 'Extra Tree Classifier', 'Gradient Boosting', 'AdaBoost','XGBoost'] 
                           , 'Accuracy Train': [acc_score_train_lr, acc_score_train_knn, acc_score_train_svc, acc_score_train_gnb,
                                                acc_score_train_dt, acc_score_train_rf, acc_score_train_et, acc_score_train_gb, acc_score_train_ab,acc_score_train_xgb]
                           , 'Accuracy Test': [acc_score_test_lr, acc_score_test_knn, acc_score_test_svc, acc_score_test_gnb,
                                               acc_score_test_dt, acc_score_test_rf, acc_score_test_et, acc_score_test_gb, acc_score_test_ab,acc_score_test_xgb]
                           , 'Best Score': [best_score_lr, best_score_knn, best_score_svc, best_score_gnb,best_score_dt, best_score_rf, 
                                            best_score_et, best_score_gb, best_score_ab,best_score_xgb]
                           , 'F1 Score Train': [f1_score_train_lr, f1_score_train_knn, f1_score_train_svc, f1_score_train_gnb,
                                                f1_score_train_dt, f1_score_train_rf, f1_score_train_et, f1_score_train_gb, f1_score_train_ab,f1_score_train_xgb]
                           , 'F1 Score Test': [f1_score_test_lr, f1_score_test_knn, f1_score_test_svc, f1_score_test_gnb,
                                                f1_score_test_dt, f1_score_test_rf, f1_score_test_et, f1_score_test_gb, f1_score_test_ab,f1_score_test_xgb]
                           , 'Precision Score Train': [precision_train_xgb, precision_train_knn, precision_train_svc, precision_train_gnb,
                                                precision_train_dt, precision_train_rf, precision_train_et, precision_train_gb, precision_train_ab,precision_train_xgb]
                           , 'Precision Score Test': [precision_test_lr, precision_test_knn, precision_test_svc, precision_test_gnb,
                                                precision_test_dt, precision_test_rf, precision_test_et, precision_test_gb, precision_test_ab,precision_test_xgb]
                           , 'Recall Score Train': [recall_train_lr, recall_train_knn, recall_train_svc, recall_train_gnb,
                                                recall_train_dt, recall_train_rf, recall_train_et, recall_train_gb, recall_train_ab,recall_train_xgb]
                           , 'Recall Score Test': [recall_test_lr, recall_test_knn, recall_test_svc, recall_test_gnb,
                                                recall_test_dt, recall_test_rf, recall_test_et, recall_test_gb, recall_test_ab,recall_test_xgb]
                           })

# --- Create Comparison Table ---
print(clr.start+f".:. Models Comparison .:."+clr.end)
print(clr.color+'*' * 26)
df_compare.sort_values(by=['Accuracy Test','F1 Score Test','Best Score'], ascending=False).reset_index(drop=True).style.background_gradient(cmap='Blues').set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', '')]}])
.:. Models Comparison .:.
**************************
Out[28]:
  Model Accuracy Train Accuracy Test Best Score F1 Score Train F1 Score Test Precision Score Train Precision Score Test Recall Score Train Recall Score Test
0 AdaBoost 93.913000 89.919000 0.905800 0.938328 0.897119 0.950926 0.915966 0.926060 0.879032
1 XGBoost 100.000000 87.903000 0.915300 1.000000 0.868421 1.000000 0.951923 1.000000 0.798387
2 Gradient Boosting 87.647000 83.468000 0.855300 0.873500 0.820961 0.894986 0.895238 0.853021 0.758065
3 Random Forest 100.000000 81.048000 0.938700 1.000000 0.779343 1.000000 0.932584 1.000000 0.669355
4 Decision Tree 96.258000 78.226000 0.880600 0.961909 0.763158 0.979439 0.836538 0.944995 0.701613
5 Support Vector Machine 86.970000 68.548000 0.842200 0.873854 0.621359 0.846870 0.780488 0.902615 0.516129
6 Logistic Regression 79.306000 67.742000 0.785400 0.795363 0.655172 1.000000 0.703704 0.804328 0.612903
7 Extra Tree Classifier 78.088000 64.113000 0.770500 0.782648 0.607930 0.776398 0.669903 0.788999 0.556452
8 Gaussian NB 66.907000 58.468000 0.662300 0.715504 0.628159 0.627464 0.568627 0.832281 0.701613
9 K-Nearest Neighbour 100.000000 55.645000 0.923400 1.000000 0.345238 1.000000 0.659091 1.000000 0.233871

8. | Miscellaneous 🧪

This section focuses onbrk>creating a complete pipelinebrk>, starting from data processing to a machine learning pipeline, using the best model concluded in the previous section andbrk>exporting it to joblib and pickle (.pkl) filesbrk>. Besides that,brk>test dataset predicted results would also be exportedbrk> along with actual results in CSV and JSON files. Moreover, this section will alsobrk>make predictions on dummy databrk> (data generated using Python functions) andbrk>export them to CSV and JSON filesbrk>.
In [29]:
# --- Complete Pipeline: Preprocessor & RF ---
xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor)
    , ('algo', xgb.XGBClassifier(random_state=1))
])

# --- Save Complete Pipeline (joblib and pickle) ---
file_name = 'pipeline_employee_churn_xgboost_pathakdarshan'
for ext in ['joblib', 'pkl']:
    joblib.dump(xgb_pipeline, f'Pipeline/{file_name}.{ext}')

9. | Conclusions and Future Improvements 🧐

From the results of dataset analysis and implementation of machine learning models in the previous section, it can be concluded as follows:
  • XGBoost is the best model out of 10 machine-learning models implemented in this notebook. This is because this model fits well with train and test data. In addition, this model also performs better than other models when predicting the test data (can be seen from the performance evaluation graph and classification report of each model).
  • The prediction results on test data and the complete machine learning pipeline have been successfully exported for other purposes. In addition, data exploration has also been successfully carried out using the ydata-profiling, seaborn, and matplotlib libraries.
  • Several improvements can be implemented in the following research/notebook. For example is performing advanced hyperparameter tuning experiments to obtain higher accuracy.

Thank You

¶